{
 "cells": [
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. _url_userguide:\n",
    "\n",
    "URLs\n",
    "===="
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    "Introduction\n",
    "--------------\n",
    "\n",
    "The function :func:`clean_url() <dataprep.clean.clean_url.clean_url>` cleans a DataFrame column containing urls, and extracts the important parameters including cleaned path, queries, scheme, etc. The function :func:`validate_url() <dataprep.clean.clean_url.validate_url>` validates either a single url or a column of urls, returning True if the value is valid, and False otherwise.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`clean_url()` extracts the important features of the url and creates an additional column containing key value pairs of the parameters. It extracts the following features:\n",
    "\n",
    "* scheme (string)\n",
    "* host (string) \n",
    "* cleaned path (string)\n",
    "* queries (key-value pairs)\n",
    "\n",
    "Remove authentication tokens: Sometimes we would like to remove certain sensitive information which is usually contained in a url for e.g. access_tokens, user information, etc. `clean_url()` provides us with an option to remove this information with the `remove_auth` parameter. The usage of all parameters is explained in depth in the sections below.\n",
    "\n",
    "Invalid parsing is handled with the `errors` parameter:\n",
    "\n",
    "* \"coerce\" (default): invalid parsing will be set to NaN\n",
    "* \"ignore\": invalid parsing will return the input\n",
    "* \"raise\": invalid parsing will raise an exception\n",
    "\n",
    "After cleaning, a **report** is printed that provides the following information:\n",
    "\n",
    "* How many values were cleaned (the value must have been transformed).\n",
    "* How many values could not be parsed.\n",
    "* A summary of the cleaned data: how many values are in the correct format, and how many values are NaN.\n",
    "\n",
    "The following sections demonstrate the functionality of `clean_url()` and `validate_url()`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "df = pd.DataFrame({\n",
    "    \"url\": [\n",
    "        \"random text which is not a url\",\n",
    "        \"http://www.facebookee.com/otherpath?auth=facebookeeauth&token=iwusdkc&not_token=hiThere&another_token=12323423\",\n",
    "        \"https://www.sfu.ca/ficticiouspath?auth=sampletoken1&studentid=1234&loc=van\",\n",
    "        \"notaurl\",\n",
    "        np.nan,\n",
    "        None,\n",
    "        \"https://www.sfu.ca/ficticiouspath?auth=sampletoken2&studentid=1230&loc=bur\",\n",
    "        \"\",\n",
    "        {\n",
    "            \"not_a_url\": True\n",
    "        },\n",
    "        \"2345678\",\n",
    "        345345345,\n",
    "        \"https://www.sfu.ca/ficticiouspath?auth=sampletoken3&studentid=1231&loc=sur\",\n",
    "        \"https://www.sfu.ca/ficticiouspath?auth=sampletoken1&studentid=1232&loc=van\",\n",
    "    ]\n",
    "})\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. default: `clean_url()`\n",
    "By default, the parameteres are set as `inplace = False`, `split = False`, `remove_auth = False`, `report = True`,`errors = coerce`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dataprep.clean import clean_url\n",
    "df_default = clean_url(df, column=\"url\")\n",
    "df_default"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that in the new dataframe `df_default` a new column is created `url_details`, this follows the naming convention of `orginal_column_name`**_details** (`url_details` in our case).\n",
    "\n",
    "Now let us see what one of the cells in `url_details` looks like."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_default[\"url_details\"][1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. `remove_auth` parameter\n",
    "\n",
    "Sometimes we need to remove sensitive information when parsing a url, we can do this in the `clean_url()` function by specifying the `remove_auth` parameter to be True or we can can specify a list of parameters to removed. Hence `remove_auth` can be a `boolean` value or list of strings.\n",
    "\n",
    "When `remove_auth` is set to the boolean value of `True`, `clean_url()` looks for auth tokens based on the default list of token names (provided below) and removes them. When `remove_auth` is set to list of strings it creates a union of the user provided list and default list to create a new set of token words to be removed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "default_list = {\n",
    "    \"access_token\",\n",
    "    \"auth_key\",\n",
    "    \"auth\",\n",
    "    \"password\",\n",
    "    \"username\",\n",
    "    \"login\",\n",
    "    \"token\",\n",
    "    \"passcode\",\n",
    "    \"access-token\",\n",
    "    \"auth-key\",\n",
    "    \"authentication\",\n",
    "    \"authentication-key\",\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lets have a look at the same dataframe and the two scenerios described above (by looking at the second row)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### a. `remove_auth = True` (boolean)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_remove_auth_boolean = clean_url(df, column=\"url\", remove_auth=True)\n",
    "df_remove_auth_boolean[\"url_details\"][1] "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see queries `auth` & `token` were removed from the result but `not_token` and `another_token` were included, this is because `auth` and `token` were specified in `default_list`. Also notice the **additional line** giving the stats on how many queries were removed from how many rows."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### b. remove_auth = list of string"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_remove_auth_list = clean_url(df, column=\"url\", remove_auth=[\"another_token\"])\n",
    "df_remove_auth_list[\"url_details\"][1] "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see queries `auth`, `token` and `another_token` were removed but `not_token` was included in the result, this is because a new list was created by creating a union of `default_list` and user defined list and queries were removed based on the new combined list"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. `split` parameter\n",
    "The `split` parameter adds individual columns containing the containing all the extracted features to the given DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_remove_split = clean_url(df, column=\"url\", split=True)\n",
    "df_remove_split"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. `inplace` parameter\n",
    "Replaces the original column with `orginal_column_name`_details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_remove_inplace = clean_url(df, column=\"url\", inplace=True)\n",
    "df_remove_inplace"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. `split` and `inplace`\n",
    "Replaces the original column with other columns based on the split parameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_remove_inplace_split = clean_url(df, column=\"url\", inplace=True, split=True)\n",
    "df_remove_inplace_split"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. `errors` parameter\n",
    "* \"coerce\" (default), then invalid parsing will be set as NaN\n",
    "* \"ignore\", then invalid parsing will return the input\n",
    "* \"raise\", then invalid parsing will raise an exception"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### a. \"coerce\" (default)\n",
    "This is the default value of the parameters, this sets the invalid parsing to NaN."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_remove_errors_default = clean_url(df, column=\"url\")\n",
    "df_remove_errors_default"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### b. \"ignore\"\n",
    "This sets the value of invalid parsing as the input."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_remove_errors_ignore = clean_url(df, column=\"url\", errors=\"ignore\")\n",
    "df_remove_errors_ignore "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### c. \"raise\"\n",
    "This will raise a value error when it encounters an invalid parsing value."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. `report` parameter\n",
    "By default it is set to `True`, when set to `False` it will not display the stats pertaining to the cleaned operations performed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_remove_auth_boolean = clean_url(df, column=\"url\", remove_auth=True, report=False)\n",
    "df_remove_auth_boolean"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 8. `validate_url()`\n",
    "`validate_url()` returns True when the input is a valid url. Otherwise it returns False."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dataprep.clean import validate_url\n",
    "print(validate_url({\"not_a_url\" : True}))\n",
    "print(validate_url(2346789))\n",
    "print(validate_url(\"https://www.sfu.ca/ficticiouspath?auth=sampletoken3&studentid=1231&loc=sur\"))\n",
    "print(validate_url(\"http://www.facebookee.com/otherpath?auth=facebookeeauth&token=iwusdkc&nottoken=hiThere&another_token=12323423\"))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame({\n",
    "    \"url\": [\n",
    "        \"random text which is not a url\",\n",
    "        \"http://www.facebookee.com/otherpath?auth=facebookeeauth&token=iwusdkc&nottoken=hiThere&another_token=12323423\",\n",
    "        \"https://www.sfu.ca/ficticiouspath?auth=sampletoken1&studentid=1234&loc=van\",\n",
    "        \"notaurl\",\n",
    "        np.nan,\n",
    "        None,\n",
    "        \"https://www.sfu.ca/ficticiouspath?auth=sampletoken2&studentid=1230&loc=bur\",\n",
    "        \"\",\n",
    "        {\n",
    "            \"not_a_url\": True\n",
    "        },\n",
    "        \"2345678\",\n",
    "        345345345,\n",
    "        \"https://www.sfu.ca/ficticiouspath?auth=sampletoken3&studentid=1231&loc=sur\",\n",
    "        \"https://www.sfu.ca/ficticiouspath?auth=sampletoken1&studentid=1232&loc=van\",\n",
    "    ]\n",
    "})\n",
    "\n",
    "df[\"validate_url\"] = validate_url(df[\"url\"])\n",
    "df"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Raw Cell Format",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
